<<<<<<< HEAD ======= >>>>>>> master

1 Goal and hypothesis

We have the following general questions to examine whether they are possible:

2 Why Citibike trip data

Here are three main reasons to select CitiBike trip data:

  1. Throughout the semester the course introduced many data science method, e.g. regression, clustering, time-series analysis, NLP, and network analysis, etc, we are looking for a wonderful data set which would allow us try the methods as many as possible.
  2. The context covered by the data set should not be unfamiliar to students, hence the solutions to challenges can be shared and have actual meanings in real life.
  3. The volume of data set should be small, though we don’t require big data. In the same time, the format of data should be not dirty so that we can spend most of time on selecting appropriate method, tuning the parameters of algorithms, and examining the results to generate interpretations that make sense.
  4. The geographical information of CitiBike data correlate well with NYC Taxi data.

3 Hypothesis

4 Tech details before running the RMD

4.1 Compile this RMD

Run the following command in terminal:

Rscript -e "rmarkdown::render('FinalProject_FDS.Rmd')"

4.2 Set-up and dependencies

WORKDIR <- getwd()
FIGDIR <- file.path(WORKDIR, 'fig')
SRCDIR <- file.path(WORKDIR, 'src')
DATADIR <- file.path(WORKDIR, 'data')
if (!dir.exists(FIGDIR)) dir.create(FIGDIR)
if (!dir.exists(SRCDIR)) dir.create(SRCDIR)
if (!dir.exists(DATADIR)) dir.create(DATADIR)

library('dplyr')
library('reshape2')
library('ggplot2')
library('lubridate')
library('readr')
library('scales')
library('cowplot')
library('forecast')
library('pheatmap')
library('ggmap')
library('igraph')
library('NbClust')

#theme_set(theme_bw(base_size = 12))
myGthm <- theme(text = element_text(size = 15),
                legend.position='bottom')

5 Download or import data

CitiBike data source is https://s3.amazonaws.com/tripdata/index.html. NYC Taxi data source is http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. Here we provided bash scripts to download and unzip data.

If you would like to run from scratch, please run following commands in terminal:

# go to work-directory of project (same as this RMD file)
cd . 
# create data folder
mkdir -p data 
cd ./src
# You may want to modify the sh file to include/exclude datas
sh get_data.sh
sh get_data_taxi.sh

5.1 CitiBike data import

If successful, CitiBike trip data set is available at folder data to be read in.

infile <- list.files(path=file.path(DATADIR), pattern='*-citibike-tripdata.csv',
                     full.names = T)
data <- bind_rows(lapply(infile, function(x) {
  read.csv(x, stringsAsFactors=FALSE)}))

5.2 NYC data import

use.RDS.for.taxi  <- TRUE
RDS.name.for.taxi <- 'taxi.RDS'

trip.data <- 
  if(use.RDS.for.taxi){
    readRDS(file.path(DATADIR, RDS.name.for.taxi))
  }else{
    infile.taxi <- list.files(path=file.path(DATADIR), pattern='\\S*_tripdata_\\d{4}-\\d{2}\\.csv',
                              full.names = T)
    bind_rows(lapply(infile.taxi, function(x) {
                                read.csv(x, stringsAsFactors=FALSE)}))
  }
head(trip.data)
##   VendorID lpep_pickup_datetime Lpep_dropoff_datetime Store_and_fwd_flag
## 1        2  2016-06-01 02:46:38   2016-06-01 03:06:40                  N
## 2        2  2016-06-01 02:55:26   2016-06-01 03:06:52                  N
## 3        2  2016-06-01 02:50:36   2016-06-01 03:08:39                  N
## 4        2  2016-06-01 02:57:04   2016-06-01 03:07:52                  N
## 5        2  2016-06-01 02:52:03   2016-06-01 03:08:12                  N
## 6        2  2016-06-01 02:59:03   2016-06-01 03:09:25                  N
##   RateCodeID Pickup_longitude Pickup_latitude Dropoff_longitude
## 1          1        -73.93058        40.69518         -74.00005
## 2          1        -73.94693        40.79255         -73.95157
## 3          1        -73.94453        40.82396         -73.99466
## 4          1        -73.95221        40.82387         -73.91436
## 5          1        -73.95798        40.71783         -73.95402
## 6          1        -73.96532        40.71103         -73.98969
##   Dropoff_latitude Passenger_count Trip_distance Fare_amount Extra MTA_tax
## 1         40.72905               1          5.24        19.5   0.5     0.5
## 2         40.82516               1          3.14        11.5   0.5     0.5
## 3         40.75042               1          7.50        23.5   0.5     0.5
## 4         40.81470               1          2.27        10.5   0.5     0.5
## 5         40.65512               3          4.90        16.5   0.5     0.5
## 6         40.71417               1          2.76        11.0   0.5     0.5
##   Tip_amount Tolls_amount Ehail_fee improvement_surcharge Total_amount
## 1       6.24            0        NA                   0.3        27.04
## 2       2.56            0        NA                   0.3        15.36
## 3       2.00            0        NA                   0.3        26.80
## 4       0.00            0        NA                   0.3        11.80
## 5       0.00            0        NA                   0.3        17.80
## 6       2.46            0        NA                   0.3        14.76
##   Payment_type Trip_type
## 1            1         1
## 2            1         1
## 3            1         1
## 4            2         1
## 5            1         1
## 6            1         1

5.3 RDS import for CitiBike

Alternatively, I have attached a RDS file which is exactly same as the above. Load the RDS file if you don’t want to download and avoid time on importing data.

data <- readRDS(file.path(DATADIR, 'citibike.rds'))
# data <- read.csv(infile[1], stringsAsFactors = F)
# data <- sample_frac(data, 0.1)
print(head(data))
##   tripduration     starttime      stoptime start.station.id
## 1         1346 1/1/2015 0:01 1/1/2015 0:24              455
## 2          363 1/1/2015 0:02 1/1/2015 0:08              434
## 3          346 1/1/2015 0:04 1/1/2015 0:10              491
## 4          182 1/1/2015 0:04 1/1/2015 0:07              384
## 5          969 1/1/2015 0:05 1/1/2015 0:21              474
## 6          496 1/1/2015 0:07 1/1/2015 0:15              512
##        start.station.name start.station.latitude start.station.longitude
## 1         1 Ave & E 44 St               40.75002               -73.96905
## 2         9 Ave & W 18 St               40.74317               -74.00366
## 3    E 24 St & Park Ave S               40.74096               -73.98602
## 4 Fulton St & Waverly Ave               40.68318               -73.96596
## 5         5 Ave & E 29 St               40.74517               -73.98683
## 6         W 29 St & 9 Ave               40.75007               -73.99839
##   end.station.id            end.station.name end.station.latitude
## 1            265    Stanton St & Chrystie St             40.72229
## 2            482             W 15 St & 7 Ave             40.73936
## 3            505             6 Ave & W 33 St             40.74901
## 4            399 Lafayette Ave & St James Pl             40.68852
## 5            432           E 7 St & Avenue A             40.72622
## 6            383  Greenwich Ave & Charles St             40.73524
##   end.station.longitude bikeid   usertype birth.year gender
## 1             -73.99148  18660 Subscriber       1960      2
## 2             -73.99932  16085 Subscriber       1963      1
## 3             -73.98848  20845 Subscriber       1974      1
## 4             -73.96476  19610 Subscriber       1969      1
## 5             -73.98380  20197 Subscriber       1977      1
## 6             -74.00027  20788 Subscriber       1969      2

There are 3 categories of features:

  1. geographic information. It contains where the trips starts and ends, i.e. the names of stations and longitude / latitude.
  2. time information. It contains when the trip starts and end.
  3. Information about users: age, female or male, member or one-time customer.

6 Data formatting

6.1 Time

The time information in raw data should be converted to time format recognized by R language.

data$startTimestamp <- as.POSIXct(strptime(data$starttime, '%m/%d/%Y %H:%M',
                                           tz = 'EST5EDT'))
data$stopTimestamp  <- as.POSIXct(strptime(data$stoptime, '%m/%d/%Y %H:%M',
                                           tz = 'EST5EDT'))
data$startweekday <- factor(weekdays(data$startTimestamp),
                            levels= c("Sunday", "Monday","Tuesday",
                                      "Wednesday", "Thursday", "Friday",
                                      "Saturday"))
data$stopweekday <- factor(weekdays(data$stopTimestamp),
                            levels= c("Sunday", "Monday","Tuesday",
                                      "Wednesday", "Thursday", "Friday",
                                      "Saturday"))
data$startHr <- format(data$startTimestamp, '%H')
data$stopHr <- format(data$stopTimestamp, '%H')

6.2 Longitude / Latitude

start_lat_min <- min(data$start.station.latitude)
start_lat_max <- max(data$start.station.latitude)
start_lon_min <- min(data$start.station.longitude)
start_lon_max <- max(data$start.station.longitude)

plot_lat_bt  <- start_lat_min - 2
plot_lat_up  <- start_lat_max + 2
plot_lon_lft <- start_lon_min - 2
plot_lon_rit <- start_lon_max + 2

The below are results # Visualize all available stations on map

stations_start <- dplyr::select(data, starts_with('start.station')) %>% unique()
stations_end <- dplyr::select(data, starts_with('end.station')) %>% unique()
colnames(stations_start) <- colnames(stations_end) <- c('STATION_ID',
                                                        'STATION_NAME',
                                                        'STATION_LAT',
                                                        'STATION_LON')
stations <- rbind(stations_start, stations_end) %>% unique()
rownames(stations) <- stations$STATION_NAME

# reorder columns as stations_name should be first to create igraph
stations <- dplyr::select(stations, STATION_NAME, STATION_ID,
                          STATION_LAT, STATION_LON)

map_stations_loc <- function(df,
                             plot_lat_bt=38.68034, plot_lat_up=42.77152,
                             plot_lon_lft=-76.01713, plot_lon_rit=-71.95005){
  ## show stations on maps
  # x is data frame for stations info
  q <- qmplot(x=STATION_LON, y=STATION_LAT,
              data = df,
              maptype = 'toner-lite',
              extent = 'device',
              zoom = 14,
              color=I('red'), alpha = I(.7))
  return(q)
}

p_all_stations <- map_stations_loc(stations)
print(p_all_stations)

7 Visualizing trips to infer hidden pattern of biker preference

Before I applied methods to analyze the data, it is essential to see the data. My philosophy is that if there are something playing important role, either itself or its consequences should also have high chances to be observed. Therefore, in order to investigate whether there exists hidden pattern of human activities, it is worthwhile to visualize the data set at first to explore the properties contained behind data set.

q <-
  qmplot(start.station.longitude, start.station.latitude,
         data = data, maptype = "toner-lite",
         geom = "blank", zoom = 14,
         legend = "right") +
  ggtitle('Pick-up') +
  stat_density_2d(aes(fill = ..level..),
                  geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2("Activities",
                       low = "ghostwhite", mid = "yellow", high = "red")

7.1 Picking activity of Usertype (One-time Customer v.s. Member)

## Only Usertype (2 types)
q1 <- q + facet_wrap( ~ usertype)
print(q1)

7.2 Picking activity of Week ~ Usertype

q2 <- q + facet_wrap(startweekday ~ usertype, ncol=8)
print(q2)

7.3 Picking activity in 24-hour of Member only during Weekdays

subscriber <- data[data$usertype == 'Subscriber', ]

## One-day at NYC (weekdays and weekends)
weekdays_idx <- subscriber$startweekday %in% c("Monday","Tuesday",
                                               "Wednesday", "Thursday", "Friday")
weekends_idx <- !weekdays_idx
subscriberWday <- subscriber[weekdays_idx, ]
subscriberWend <- subscriber[weekends_idx, ]
qSubWday <-
  qmplot(start.station.longitude, start.station.latitude,
         data = subscriberWday, maptype = "toner-lite",
         geom = "blank", zoom = 14,
         legend = "right") +
  stat_density_2d(aes(fill = ..level..),
                  geom = "polygon", alpha = .3, color = NA) +
  scale_fill_gradient2("Pick-Up Activities",
                       low = "ghostwhite", mid = "yellow", high = "red")

qSubWday2 <- qSubWday + facet_wrap(~startHr, ncol=8)
print(qSubWday2)

7.4 Conclusions from visualization

In first figure where picking activities all the time for Customer v.s. Member is displayed, it can be found that one-time customers pick up bikes without preferences, while the members have enriched regions, for example the station at 8th Ave & W 31 St, which is closest to Penn Station. It implies the existence of specific reason to use CitiBike, i.e. members tend to ride for daily commuting while one-time customers ride for fun.

Furthermore, in the second figure where additional comparison on weekdays is displayed, it further supports the implication because the highlighted regions of members are stable and one-time customers are still present everywhere.

Finally, by increasing the temporal resolution to hourly, the exact hours when activities of members are enriched can be found. In the early morning (6, 7, 8 AM, i.e. rush hour in the morning) the enriched regions fired up; they faded away during the day until the nightfall. At nightfall (5, 6 PM i.e. rush-hour), the enriched regions showed up again.

In sum, there are two messages inferred by data visualization:

  1. There are hidden pattern of people to use CitiBike.
  2. Due to the hidden pattern of human activities, different station has its own temporal profile. For example, they have different rush-hours in terms of picking and docking.

8 Identify subgroups of stations with distinct temporal profiles

8.1 Convert long-format data into wide-format

Each row in raw data is a trip, so the raw data set is in long-format. In order to create temporal profiles of stations, the long-format should be summarized into wide-format.

set.seed(1234)
## Long-format to wide-format of citybike
citibike_long2wide <- function(data, activity_type = c('pick', 'dock')){
  if (activity_type == 'pick') {
    station_24hr_long <- dplyr::select(data, start.station.name, startHr)
  } else if (activity_type == 'dock') {
    station_24hr_long <- dplyr::select(data, end.station.name, stopHr)
  } else {
    stop('Error: activity type is either pick or dock.')
  }
  colnames(station_24hr_long) <- c('NAME', 'HOUR')
  station_24hr_long <- group_by(station_24hr_long, NAME, HOUR) %>%
    summarise(TRIPS=n())
  station_24hr_wide <- dcast(station_24hr_long, NAME~HOUR,
                             value.var = 'TRIPS', sum, fill=0)
  rownames(station_24hr_wide) <- station_24hr_wide$NAME
  station_24hr_wide <- station_24hr_wide[, -1]
  # print(head(station_24hr_wide))
  return(station_24hr_wide)
}
pick_station_24hr <- citibike_long2wide(data=data, activity_type='pick')
dock_station_24hr <- citibike_long2wide(data=data, activity_type='dock')

The original data set is summarized into two big matrices. One for picking, while the other is for docking. Each row is station and each column is the hour point. Example of wide-format matrix of pick activity is:

print(head(pick_station_24hr[, 1:6]))
##                  00 01 02 03 04 05
## 1 Ave & E 15 St  46 22 21  8  6 22
## 1 Ave & E 18 St   9  1  0  0  1 13
## 1 Ave & E 30 St  14  3  9  2  1  2
## 1 Ave & E 44 St   5  0  0  0  0  3
## 10 Ave & W 28 St 13  8 12 10  2 16
## 11 Ave & W 27 St  4  1  5  2  2 10
## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2

8.2 Visualize temporal profile of stations

To visualize the matrix, here I selected top 20 stations with most picking activities for example.

orderbyRowSum <- function(data){
  o <- mutate(data, SUM=rowSums(data), row_names=rownames(data)) %>%
    dplyr::arrange(desc(SUM), row_names) %>%
    dplyr::select(-SUM)
  rownames(o) <- o$row_names
  o <- dplyr::select(o, -row_names)
  return(o)
}
top_N <- 20
pick_station_24hr_desc <- orderbyRowSum(pick_station_24hr)
dock_station_24hr_desc <- orderbyRowSum(dock_station_24hr)
top_pick_station_names <- rownames(pick_station_24hr_desc)[seq_len(top_N)]
top_dock_station_names <- rownames(dock_station_24hr_desc)[seq_len(top_N)]
desc_pick_station_names <- rownames(pick_station_24hr_desc)
desc_dock_station_names <- rownames(dock_station_24hr_desc)
pheatmap(pick_station_24hr_desc[top_pick_station_names, ],
         cluster_rows = F, cluster_cols = F,
         main='Picking activities of Stations with Top 24-hr Picking Activities')

pheatmap(dock_station_24hr_desc[top_pick_station_names, ],
         cluster_rows = F, cluster_cols = F,
         main='Docking activities of Stations with Top 24-hr Picking Activities')

## quartz_off_screen 
##                 2

If compare the two matrices side-by-side, stations have different rush-hour:

  • High picking in early morning and high docking at nightfall (e.g. 8 Ave & W 31 St)
  • Stable activities throughout the day (e.g. Broadway & E 14 St)

8.3 P/D Index

It can be inferred there exists other pattern, but it not straightforward to make conclusions from the two matrix above. Therefore, it is suggested to develop a metric to integrate the two information together.

Here is how P/D Index is calculated.

  1. Inputs are picking matrix and docking matrix with rows as stations.
  2. For each matrix, perform step 3 and 5 for normalization.
  3. For each row, calculate the max.
  4. For each row, the values are divided by the max.
  5. All values of matrix add one.
  6. The normalized picking matrix is divided by the normalized docking matrix to generate P/D index matrix.
row_norm_byMax <- function(x){
  t(apply(x, 1, function(r) {
    r/max(r)
  }))
}
pd_station_24hr <- (row_norm_byMax(pick_station_24hr)+1) / (row_norm_byMax(dock_station_24hr)+1)
pd_station_24hr <- as.data.frame(pd_station_24hr)

The intuition of P/D index is that high P/D index suggests one of the following three situations:

  1. Higher picking activity.
  2. Lower docking activity.
  3. Both situation 1 and 2 happen in the same time.

8.4 Infer subgroups of stations by clustering on P/D index matrix

The P/D index of the same stations as shown before is displayed in the following heat-map.

pheatmap(pd_station_24hr[top_pick_station_names, ],
         cluster_rows = F,
         cluster_cols = F,
         main='P/D Index of Stations with Top 24-hr Picking Activities')

print('P/D Index has average:')
## [1] "P/D Index has average:"
print(sum(pd_station_24hr)/prod(dim(pd_station_24hr)))
## [1] 1.0313

Now it is much more straightforward to infer possible subgroups of stations. Now I applied K-means for all about 300 stations to find the hidden subgroups of stations in terms of temporal profiles.

p_pd_kmeans <- pheatmap(pd_station_24hr,
                        kmeans_k = K,
                        cluster_cols = F,
                        main='P/D Index of Stations')

## quartz_off_screen 
##                 2

Therefore, there are at least 3 types of stations by P/D index in 24hr:

  • Type-A: High in morning & Low in evening: home-like stations
  • Type-B: Low in morning & High in evening: company-like stations
  • Type-C: Normal along day-light: no special usage
  • Type-X: Others
rush_hr_morning_lab <- c('06', '07', '08')
rush_hr_evening_lab <- c('16', '17', '18')
# Manually extract clusters of interest
# double-check before continue
# pheatmap(p_pd_kmeans$kmeans$centers, cluster_cols = F)
pd_kmeans_center <- p_pd_kmeans$kmeans$centers
pd_kmeans_cluster <- p_pd_kmeans$kmeans$cluster
## Type-1
stations_type1_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(5, 9)])
p_station_type1 <- dplyr::filter(stations, STATION_NAME %in% stations_type1_names) %>%
  map_stations_loc()
## Type-2
stations_type2_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(4, 8, 10)])
p_station_type2 <- dplyr::filter(stations, STATION_NAME %in% stations_type2_names) %>%
  map_stations_loc()
## Type-3
stations_type3_names <- names(pd_kmeans_cluster[pd_kmeans_cluster %in% c(2, 6)])
p_station_type3 <- dplyr::filter(stations, STATION_NAME %in% stations_type3_names) %>%
  map_stations_loc()
## Type-4 unknown: the rest stations
stations_types_summary <- rep('X', NROW(stations))
stations_types_summary[stations$STATION_NAME %in% stations_type1_names] <- LETTERS[1]
stations_types_summary[stations$STATION_NAME %in% stations_type2_names] <- LETTERS[2]
stations_types_summary[stations$STATION_NAME %in% stations_type3_names] <- LETTERS[3]

8.5 Visualize subgroups of stations on map

stations_byTypes <- dplyr::mutate(stations, TYPE=stations_types_summary)
p_stations_byTypes <- qmplot(x=STATION_LON, y=STATION_LAT,
                             data = stations_byTypes,
                             maptype = 'toner-lite',
                             extent = 'device',
                             zoom = 14,
                             color=TYPE,
                             size=I(2.5))
print(p_stations_byTypes)

8.6 Summary of subgroup of stations

First, I developed P/D index to integrate the picking and docking activities as a straightforward metric to reflect the temporal profile of stations.

Second, by performing clustering on hourly P/D index matrix of stations, the stations with different hourly activities formed several distinct subgroups. Among them, the Type-A is likely to be home-like stations, and Type-B is like company-like stations, and Type-C is regular stations.

Finally, by grouping stations into different groups, I can figure out suggestions to CitiBike company to better manually balance bike among stations.

9 Time-series Analysis of stations

Stations have different “rush-hour”, therefore, it is worthwhile to treat them differently and perform time-series analysis to model the temporal profile for possible forecasting.

Here I performed time-series analysis and use AR(4) model to fit the usage at specific station.

Prepare observed usage at specific station.

s <- stations_type1_names[1]
sid <- dplyr::filter(stations, STATION_NAME == s) %>%
  select(STATION_ID) %>% as.numeric()
print(s)
## [1] "1 Ave & E 15 St"
# Month of interest
m <- 'Jan'

# station-date-hour-trips data.frame
# Note: in some hour point, there is no activity at all thus need set them as zero
pick_s_days_avail <- dplyr::filter(data,
                                   start.station.name %in% s,
                                   month(startTimestamp, label=T) == m) %>%
  mutate(DATE=date(startTimestamp)) %>%
  group_by(start.station.name, DATE, startHr) %>%
  summarise(TRIPS=n()) %>% as.data.frame()
colnames(pick_s_days_avail) <- c('STATION_NAME', 'DATE', 'HOUR', 'TRIPS')

pick_s_days_grid <- expand.grid(STATION_NAME=s,
                                DATE=unique(pick_s_days_avail$DATE),
                                HOUR=sprintf("%02d", 0:23),
                                TRIPS=0)
pick_s_days <- bind_rows(pick_s_days_avail, pick_s_days_grid) %>%
  group_by(STATION_NAME, DATE, HOUR) %>%
  summarise(NTRIPS=sum(TRIPS)) %>%
  mutate(TIMESTAMP=ymd_h(paste(DATE, HOUR), tz='EST5EDT'),
         YEAR=year(DATE))
obs_trips <- as.numeric(pick_s_days$NTRIPS)
ar4 <- ar(obs_trips, F, 4)
fit_trips <- fitted(ar4)
fit_trips[seq_len(4)] <- obs_trips[seq_len(4)]

pick_s_days_ar <- ungroup(pick_s_days) %>% dplyr::select(DATE, HOUR, YEAR) %>%
  mutate(OBS=obs_trips, FIT=fit_trips) %>%
  melt(id.vars=c('DATE', 'HOUR', 'YEAR'), value.name='NTRIPS',
       variable.name='Type')
p_pick_s_ar <- ggplot(pick_s_days_ar, aes(x=DATE, y=HOUR, fill=NTRIPS)) +
  geom_tile(color = "white", size = 0.4) +
  scale_fill_gradient(low="ghostwhite", high="red") +
  scale_x_date(date_breaks="1 week",  date_labels="%d-%b-%y") +
  facet_grid(YEAR~Type) +
  xlab('') + ylab('Hour') + labs(fill='Trips') +
  ggtitle(paste0('Observed and AR(4) fitted picking activity at ', s))
print(p_pick_s_ar)

By the AR(4) model, the temporal profile of the specific station is created. If needed, the CitiBike company can predict the future usage by this model.

10 Community detection of stations

There is a directed weighted graph behind the raw data set. Each station can be considered as the node, and the trip can be the directed edge. Number of trips between stations denotes the weight of each edge.

Therefore, it is suggested to run community detection algorithm with following two goals:

The algorithm used here is called Info-map, which is built-in function in igraph package. Betweenness-based method is not used, as it is much more time-consuming.

10.1 Build directed weighted graph from data

net0_edges_wt <- group_by(data, start.station.name, end.station.name) %>%
  summarise(TRIPS=n()) %>%
  ungroup() %>% as.data.frame()
colnames(net0_edges_wt) <- c('from', 'to', 'weight')

stations <- dplyr::select(stations, STATION_NAME, STATION_ID,
                          STATION_LAT, STATION_LON)
net0 <- graph_from_data_frame(d=net0_edges_wt,
                              directed=T,
                              vertices=stations)
E(net0)$width <- E(net0)$weight / 10
E(net0)$arrow.size <- .2
E(net0)$edge.color <- "gray80"
## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2
## quartz_off_screen 
##                 2

10.2 Perform community detection on full graph

community_detect_stations <- function(net, nodes_geo,
                                      method=c('infomap', 'betweenness')){
  if (method == 'infomap'){
    imc <- cluster_infomap(net)
    memb <- membership(imc)
  } else if (method == 'betweenness') {
    ebc <- cluster_edge_betweenness(net)
    memb <- membership(ebc)
  } else {
    stop('Unknown method for community detection')
  }
  stations_community <- data.frame(STATION_NAME=names(memb),
                                   COMMUNITY=factor(memb))
  p_community <- left_join(x=stations_community, y=nodes_geo,
                           by=c('STATION_NAME')) %>%
    qmplot(data = ., x=STATION_LON, y=STATION_LAT,
           maptype = 'toner-lite',
           extent = 'device',
           zoom = 14,
           color=COMMUNITY, shape=COMMUNITY, size=I(2.5))
}
p_community_infomap_nyc <- community_detect_stations(net=net0,
                                                     nodes_geo=stations,
                                                     method='infomap')
print(p_community_infomap_nyc)

It is not surprising to see the entire New York City is partitioned into Manhattan and Brooklyn communities based on the full data set.

10.3 Community detection on only Type-A and Type-B stations

Type-A is home-like station and Type-B is company-like station.

net_stations_typeAnB <- induced.subgraph(net0, vids=c(stations_type1_names,
                                                    stations_type2_names))
p_community_infomap_typeAnB <- community_detect_stations(net=net_stations_typeAnB,
                                                         nodes_geo=stations,
                                                         method='infomap')
print(p_community_infomap_typeAnB)

In Manhattan, there are 2 big communities. The two communities are generally separated by 23th Street. That is to say, for example, the Type-A stations around Penn Stations are more likely to reach destinations in East side and will not go to downtown’s direction. And Type-A stations around Avenue A & E 10th Street are more likely to go to Wall Street areas.

11 Model evaluation

There are two layers of evaluation.

  1. Evaluate the whether the results are correct.
  2. Evaluate the performance of method.

For first issue, the answer is no. Because this project is mostly data-driven, though I did start from a hypothesis that the activity at stations reflects the hidden pattern of human preference, the entire analysis is exploratory and there is no golden-standard for evaluating whether these findings are true. The argument for evaluating the results from unsupervised machine learning can be applied to this project. Exploratory data analysis does not mean there is no way to evaluate whether method is appropriate, so the second issue is answered yes. Throughout the project there are several sections to monitor the performance.

For example, in order to infer the appropriate number of clusters when performing K-means on 24-hour P/D index matrix, I used silhouette metric, which is implemented in NbClust package.

pd_station_K <- NbClust(data = pd_station_24hr, 
                        min.nc=3, max.nc=20, method='kmeans', index='silhouette')
plot_nbclust_obj <- function(nbclust_obj){
  nbclust_stats_k <- as.numeric(names(nbclust_obj$All.index))
  nbclust_stats_val <- as.numeric(nbclust_obj$All.index)
  nbclust_stats_val[is.na(nbclust_stats_val)] <- 0
  plot(x=nbclust_stats_k, y=nbclust_stats_val, type="b",
       xlab="Number of possible clusters",
       ylab='silhouette',
       main="Infer the best number of clusters")
}
plot_nbclust_obj(pd_station_K)

Because I would manually investigate the clustered heat-map to pick up the groups to define Type-A, Type-B and Type-C stations, it is ok to manually set K as 10 at first. The results of 3 types are consistent to automatically set K as 3, inferred from the model evaluation part.

12 Visualization for NYC Taxi

save.and.print <- function (plot.func, file.name){
  pdf(file.path(FIGDIR, file.name), 10, 10)
  p <- plot.func()
  print(p)
  dev.off()
  print(plot.func())
}

12.1 Data Schema

print(summary(trip.data))
##     VendorID     lpep_pickup_datetime Lpep_dropoff_datetime
##  Min.   :1.000   Length:1404726       Length:1404726       
##  1st Qu.:2.000   Class :character     Class :character     
##  Median :2.000   Mode  :character     Mode  :character     
##  Mean   :1.795                                             
##  3rd Qu.:2.000                                             
##  Max.   :2.000                                             
##                                                            
##  Store_and_fwd_flag   RateCodeID     Pickup_longitude Pickup_latitude
##  Length:1404726     Min.   : 1.000   Min.   :-75.92   Min.   : 0.00  
##  Class :character   1st Qu.: 1.000   1st Qu.:-73.96   1st Qu.:40.69  
##  Mode  :character   Median : 1.000   Median :-73.95   Median :40.75  
##                     Mean   : 1.092   Mean   :-73.83   Mean   :40.69  
##                     3rd Qu.: 1.000   3rd Qu.:-73.92   3rd Qu.:40.80  
##                     Max.   :99.000   Max.   :  0.00   Max.   :42.32  
##                                                                      
##  Dropoff_longitude Dropoff_latitude Passenger_count Trip_distance    
##  Min.   :-75.92    Min.   : 0.00    Min.   :0.000   Min.   :  0.000  
##  1st Qu.:-73.97    1st Qu.:40.70    1st Qu.:1.000   1st Qu.:  1.070  
##  Median :-73.95    Median :40.75    Median :1.000   Median :  1.900  
##  Mean   :-73.85    Mean   :40.70    Mean   :1.359   Mean   :  2.879  
##  3rd Qu.:-73.91    3rd Qu.:40.79    3rd Qu.:1.000   3rd Qu.:  3.600  
##  Max.   :  0.00    Max.   :42.32    Max.   :9.000   Max.   :268.190  
##                                                                      
##   Fare_amount          Extra            MTA_tax          Tip_amount     
##  Min.   :-499.00   Min.   :-4.5000   Min.   :-0.5000   Min.   :-13.200  
##  1st Qu.:   6.50   1st Qu.: 0.0000   1st Qu.: 0.5000   1st Qu.:  0.000  
##  Median :   9.50   Median : 0.5000   Median : 0.5000   Median :  0.000  
##  Mean   :  12.51   Mean   : 0.3502   Mean   : 0.4868   Mean   :  1.307  
##  3rd Qu.:  15.00   3rd Qu.: 0.5000   3rd Qu.: 0.5000   3rd Qu.:  2.000  
##  Max.   :3347.50   Max.   : 4.5000   Max.   : 0.5000   Max.   :300.080  
##                                                                         
##   Tolls_amount     Ehail_fee      improvement_surcharge  Total_amount    
##  Min.   :-5.5400   Mode:logical   Min.   :-0.3000       Min.   :-499.00  
##  1st Qu.: 0.0000   NA's:1404726   1st Qu.: 0.3000       1st Qu.:   8.19  
##  Median : 0.0000                  Median : 0.3000       Median :  11.76  
##  Mean   : 0.1191                  Mean   : 0.2921       Mean   :  15.09  
##  3rd Qu.: 0.0000                  3rd Qu.: 0.3000       3rd Qu.:  18.30  
##  Max.   :98.0000                  Max.   : 0.3000       Max.   :3349.30  
##                                                                          
##   Payment_type     Trip_type    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :1.000  
##  Mean   :1.515   Mean   :1.021  
##  3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :2.000  
##                  NA's   :2

12.2 Download Map Data For NYC

Download Map data with Cordinate auto-adjustory to NYC Taxi Data

# Find outliers
remove.outliers <- function(x) x[!x %in% boxplot.stats(x)$out]
# Find min and max at the same time
min.and.max <-function(x) c(min(x),max(x))

# Find Bondery
longitude.summary <- min.and.max(c(
  min.and.max(remove.outliers(trip.data$Pickup_longitude)),
  min.and.max(remove.outliers(trip.data$Dropoff_longitude)) 
  ))
latitude.summary <-  min.and.max(c(
  min.and.max(remove.outliers(trip.data$Pickup_latitude)),
  min.and.max(remove.outliers(trip.data$Dropoff_latitude)) 
  ))

12.2.1 Featch Map

map <- ggmap::get_stamenmap(bbox = c(left = longitude.summary[1],
                                         bottom=latitude.summary[1],
                                         right= longitude.summary[2],
                                         top  = latitude.summary[2]
                                         ),
                                 zoom = 12,
                                 maptype = "toner-lite")

12.3 Sample data

Loading the whole data set would be too much, to find an optimium value for kmeans ### Get sample

sample.size <- 40000
sample.row.nums <- sample.int(length(row.names((trip.data))), sample.size)
trip.data.sample <- trip.data[sample.row.nums,]
trip.data.sample <- trip.data.sample[trip.data.sample$Pickup_longitude<0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Pickup_latitude >0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Dropoff_longitude<0,]
trip.data.sample <- trip.data.sample[trip.data.sample$Dropoff_latitude >0,]
taxi.sample.size <- length(row.names(trip.data.sample))
print(paste("Sample Size:", taxi.sample.size))
## [1] "Sample Size: 39911"

12.3.1 Pickup and Dropoff Visualization of Sample Data

Red dots for Dropoffs and Blue for Pickups

Taxi.Pickup.And.Dropoff.with.Sample.Data <- function(){
  ggmap::ggmap(map, main = "Taxi Pickup And Dropoff with Sample Data") + 
    ggplot2::geom_point(ggplot2::aes(x = Pickup_longitude, y = Pickup_latitude),
                        data = trip.data.sample,
                        shape = ".", color = "blue")+
    ggplot2::geom_point(ggplot2::aes(x = Dropoff_longitude, y = Dropoff_latitude),
                        data = trip.data.sample,
                        shape = ".", color = "red")
}
save.and.print(
  Taxi.Pickup.And.Dropoff.with.Sample.Data,
  'taxi_pickup_and_dropoff_sample.pdf')

12.3.2 Sample Clustering with k=100

Fisrt Just pick k=100 to see a sample reslut.

t.k.s.d <- kmeans(trip.data.sample[,8:9],100, iter.max=100)
t.k.s.d$cluster <- as.factor(t.k.s.d$cluster)
Taxi.Cluster.Sample.with.k.100 <- function(){
  ggmap::ggmap(map, main = "Taxi Cluster Sample with k=100") + 
  ggplot2::geom_point(data = trip.data.sample, shape = ".",
                      ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = t.k.s.d$cluster)
                      )+
  ggplot2::geom_point(data = as.data.frame(t.k.s.d$centers), shape = 21,
                      ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
                                   fill = as.factor(row.names(t.k.s.d$centers)),
                                   size = as.numeric(t.k.s.d$size))
                      )+ theme(legend.position="none")
}
save.and.print(
  Taxi.Cluster.Sample.with.k.100, 
  "taxi_cluster_sample_with_k=100.pd")

12.3.3 Find k within the dataset itself

Now let’s find a optimium value of k using elbow mothod.

12.3.3.1 Get statistics for different k means

This is where computation gets time consumeing, so I decide to run it on sample data set instead of a whole one.

taxi.k.numbers <- 1:80

taxi.collect.kmean.statistics <- function (k) {
  kmean.result <- kmeans(trip.data.sample[,8:9],centers = k, iter.max = max(100,k*2), algorithm="MacQueen")
  c( kmean.result$tot.withinss, kmean.result$betweenss )
}
taxi.k.means.statitcs <- plyr::laply(taxi.k.numbers, 
                                     taxi.collect.kmean.statistics,
                                     .progress= plyr::progress_text(char = '+')
                                     )

  |+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
# Set label
taxi.k.means.statitcs <- as.data.frame(taxi.k.means.statitcs)
colnames(taxi.k.means.statitcs) <- c("Withiness","Betweeness")
row.names(taxi.k.means.statitcs) <- taxi.k.numbers

# Transform to mean value instead of sum
taxi.k.means.statitcs$Withiness <- taxi.k.means.statitcs$Withiness/taxi.sample.size
taxi.k.means.statitcs$Betweeness <- taxi.k.means.statitcs$Betweeness/taxi.sample.size

12.3.3.2 Betweeness and Withiness

Taxi.Withingness.And.Betweeness.by.different.K <- function(){
  k <- as.numeric(row.names(taxi.k.means.statitcs))
  ggplot2::ggplot(taxi.k.means.statitcs, main = "Taxi Withingness And Betweeness by different K")+
    ggplot2::geom_line(ggplot2::aes(x = k, y = Withiness), color = "blue")+
    ggplot2::geom_line(ggplot2::aes(x = k, y = Betweeness), color = "red")
}
save.and.print(
  Taxi.Withingness.And.Betweeness.by.different.K,
  "Taxi.Withingness.And.Betweeness.by.different.K.pdf"
)

Red for Betweeness and Blue for Withiness.

12.3.3.3 A Closer look

Taxi.Withingness.And.Betweeness.by.different.K.Closer <- function(){
  closer <- taxi.k.means.statitcs[3:30,]
  k <- as.numeric(row.names(closer))
  ggplot2::ggplot(closer, main = "Taxi Withingness And Betweeness by different K - Closer")+
    ggplot2::geom_line(ggplot2::aes(x = k, y = Withiness), color = "blue")+
    ggplot2::geom_line(ggplot2::aes(x = k, y = Betweeness), color = "red")
}
save.and.print(
  Taxi.Withingness.And.Betweeness.by.different.K.Closer,
  "Taxi.Withingness.And.Betweeness.by.different.K.closer.pdf"
)

Red for Betweeness and Blue for Withiness. So we pick up k = 7

13 Cluster on whole Taxi dataset

k=7
t.k.s.d <- kmeans(trip.data[,8:9],k, iter.max=200, algorithm="MacQueen")

t.k.s.d$cluster <- as.factor(t.k.s.d$cluster)
Taxi.Need.Clustering <- function(){
  ggmap::ggmap(map, main = "Taxi Need Clustering") + 
    ggplot2::geom_point(data = trip.data, shape = ".",
                        ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = as.factor(t.k.s.d$cluster))
                        )+
    ggplot2::geom_point(data = as.data.frame(t.k.s.d$centers), shape = 21,
                        ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
                                     fill = as.factor(row.names(t.k.s.d$centers)),
                                     size = as.numeric(t.k.s.d$size))
                        ) + theme(legend.position="none")
}
save.and.print(Taxi.Need.Clustering, "Taxi.Need.Clustering.pdf")

From the grapgh we can see that k=5 seems meanningful, but, not suitable for CitiBike Potential Station predicting: It’s too rough. So what do we do?

14 Cluster on features exteacted from CitiBike

14.1 Why?

I look back to think, what makes CitiBike pick all the stations that way:

print(p_all_stations)

In specific, why do they put all stations so close to each other? The Answer is simple: People walk to nearby stations to get bikes! In other words: Distance is very limited! So the key to what Taxi data should take standard for k value adoption can be infered from Distance of Stations already exist

14.2 MCD of Stations

I’ve developed an metrix for measuring Distance of Stations: Mean squared Close Distance (MCD)

# MCD
MCD <- function(points){
  # Calculate Distance Matrix
  distance.matrix <- function(points, func){
    as.matrix(apply(points, 1, function(p1) apply(points, 1, function(p2) func(p1, p2))))
  }
  point.distances <- distance.matrix(points, fun = 
                                       function(x,y) (x[1]-y[1])^2 + (x[2]-y[2])^2
                                     )
  
  # Find Nearest Distance
  min.no <- function(A) min(A[A>0])
  point.closest.distance <- apply(point.distances, 1, min.no)
  
  # Mean
  mean.point.closest.distance <- mean(point.closest.distance)
  mean.point.closest.distance
}

mean.station.closest.distance <- MCD(stations[,c('STATION_LON','STATION_LAT')])
print(paste("MCD of Stations:", mean.station.closest.distance))
## [1] "MCD of Stations: 5.56514561189345e-06"

14.3 MCD over k on Taxi Data

taxi.max.k <- 500
taxi.collect.kmean.statistics.MCD <- function (k) {
  kmeans.result <- kmeans(trip.data.sample[,8:9],centers = k, iter.max = max(100,k*2), algorithm="MacQueen")
  MCD(kmeans.result$centers)
}
taxi.k.means.statitcs.MCD <- plyr::laply(1:taxi.max.k, 
                                     taxi.collect.kmean.statistics.MCD,
                                     .progress= plyr::progress_text(char = '+')
                                     )
## 
  |+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
taxi.k.means.statitcs.MCD<- data.frame(taxi.k.means.statitcs.MCD)
taxi.k.means.statitcs.MCD$k <- 1:taxi.max.k
colnames(taxi.k.means.statitcs.MCD) <- c("MCD","k")
Taxi.MCD.by.different.K <- function(){
  k <- as.numeric(row.names(taxi.k.means.statitcs.MCD))
  ggplot2::ggplot(taxi.k.means.statitcs.MCD, main = "Taxi MCD by different K")+
    ggplot2::geom_line(ggplot2::aes(x = k, y = MCD))
}
save.and.print(
  Taxi.MCD.by.different.K,
  "Taxi.MCD.by.different.K.pdf"
)

14.3.1 Best k

taxi.k.means.statitcs.MCD[taxi.k.means.statitcs.MCD$MCD<mean.station.closest.distance,]
## [1] MCD k  
## <0 rows> (or 0-length row.names)

14.4 There should be station here

k=300
all.stations.should.kmeans <- kmeans(trip.data[,8:9],k, iter.max=200, algorithm="MacQueen")

all.stations.should.kmeans$cluster <- as.factor(all.stations.should.kmeans$cluster)
Stations.should.be.here <- function(){
  ggmap::ggmap(map, main = "Stations should be here") + 
    ggplot2::geom_point(data = trip.data, shape = ".",
                        ggplot2::aes(Dropoff_longitude, Dropoff_latitude, color = as.factor(all.stations.should.kmeans$cluster))
                        )+
    ggplot2::geom_point(data = as.data.frame(all.stations.should.kmeans$centers), shape = 21,
                        ggplot2::aes(Dropoff_longitude, Dropoff_latitude,
                                     fill = as.factor(row.names(all.stations.should.kmeans$centers)),
                                     size = as.numeric(all.stations.should.kmeans$size))
                        ) + theme(legend.position="none")
}
save.and.print(Stations.should.be.here, "Stations.should.be.here.pdf")

<<<<<<< HEAD

15 Final Conclusions

Human activities cannot be observed directly, but the picking and docking activities at stations provide indirect yet robust data for us to investigate the hidden pattern. First it can be inferred from data visualization part that there exists possible hidden pattern of people to use CitiBike. Second, the hypothesis is proven by looking at the temporal profiles of about 300 stations. The temporal profiles are based on P/D index, which is a novel metric for merging picking and docking information together and provides a straightforward method for data analysis. In addition, we modeled the “rush-hour” properties of specific station as a example to suggest a time-schedule for CitiBike company to manually balance bikes at the specific station. Finally, we identified the communities of stations by performing Info-map algorithm and the communities are roughly consistent to the administrative regions around New York City, which suggest our findings are robust. New CitiBike Stations should be added, since there are so much needs and only parts of them are fullfilled.

=======

11 Final Conclusions

Human activities cannot be observed directly, but the picking and docking activities at stations provide indirect yet robust data for us to investigate the hidden pattern.

First it can be inferred from data visualization part that there exists possible hidden pattern of people to use CitiBike. Second, the hypothesis is proven by looking at the temporal profiles of about 300 stations. The temporal profiles are based on P/D index, which is a novel metric for merging picking and docking information together and provides a straightforward method for data analysis. In addition, we modeled the “rush-hour” properties of specific station as a example to suggest a time-schedule for CitiBike company to manually balance bikes at the specific station. Finally, we identified the communities of stations by performing Info-map algorithm and the communities are roughly consistent to the administrative regions around New York City, which suggest our findings are robust.

>>>>>>> master

16 Further work

Future work is to find whether there are stations of which the temporal profile changes after some year. For example, station-X during 2013-2015 is Type-A station, i.e. home-like station. But all of a sudden, since 2016, the temporal profile of station-X changed to Type-C station, i.e. stable through all the day. It would suggest either the neighborhood got entirely changed, or the major transportation station got changed. The meaning is that without directly investigating the hidden pattern of human activities, we can still have an indirect approach to capture the changes. An order of new Stations can be figured. For this, meausre of renevues are needed, this can be done once estimate of income from each stations are available.

17 Contributions

Yun Yan

Willian Zhang

18 Codes

https://github.com/Puriney/ineedabike

<<<<<<< HEAD

19 Clarification

All works described in Project Proposal have been finished, including the optional one (network analysis). During presentation day, only first part (data visualization) and second part (subgroups of stations) are introduced, while the third and final part (time series analysis and community detection) are not because of running out time. But the same content can be found in the submitted presentation slide file.

=======

15 Clarification

All 3 proposed parts of work described in Project Proposal have been finished, so is the optional part (network analysis). During presentation day, only first part (data visualization) and second part (subgroups of stations) were introduced, while the third and final part (time series analysis and community detection) were not because of running out of presentation time. The same content in this project report can also be found in the submitted presentation slide file.

>>>>>>> master

20 Session-info

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.1 (Sierra)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NbClust_3.0       igraph_1.0.1      ggmap_2.6.1      
##  [4] pheatmap_1.0.8    forecast_7.2      timeDate_3012.100
##  [7] zoo_1.7-13        cowplot_0.7.0     scales_0.4.0     
## [10] readr_1.0.0       lubridate_1.6.0   ggplot2_2.1.0    
## [13] reshape2_1.4.1    dplyr_0.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7        formatR_1.4        RColorBrewer_1.1-2
##  [4] plyr_1.8.4         tseries_0.10-35    tools_3.3.1       
##  [7] digest_0.6.10      evaluate_0.9       tibble_1.2        
## [10] gtable_0.2.0       lattice_0.20-33    png_0.1-7         
## [13] DBI_0.5-1          mapproj_1.2-4      yaml_2.1.13       
## [16] parallel_3.3.1     proto_0.3-10       stringr_1.1.0     
## [19] knitr_1.14         maps_3.1.1         RgoogleMaps_1.4.1 
## [22] grid_3.3.1         nnet_7.3-12        R6_2.2.0          
## [25] jpeg_0.1-8         rmarkdown_1.1      sp_1.2-3          
## [28] magrittr_1.5       MASS_7.3-45        htmltools_0.3.5   
## [31] assertthat_0.1     geosphere_1.5-5    colorspace_1.2-6  
## [34] fracdiff_1.4-2     labeling_0.3       quadprog_1.5-5    
## [37] stringi_1.1.1      lazyeval_0.2.0     munsell_0.4.3     
## [40] rjson_0.2.15